Causal Inference and Machine Learning

Benjamin J. Radford

Introduction

Causal Inference and Machine Learning.

  1. Go to: www.github.com/benradford/double-ml-workshop
  2. In upper right, click Code and then Download Zip.
  3. Unzip the folder and open double-ml.qmd in RStudio.

This Talk

  • Introduction

    • Inference

    • Causality

    • Causal Inference

    • Machine Learning

  • Double Machine Learning / Debiased Machine Learning

  • Your Turn!

  • Conclusion

Inference

  • Inference means that we are attempting to learn (i.e., estimate) about the value of a population parameter that we can’t observe.

  • For example, we might estimate \(\hat{\beta}\) using OLS, our estimator for \(\beta\) in the population, where \(\beta\) describes the relationship (i.e., slope) between \(X\) and \(Y\).

Causality

  • Causality is not the same as inference.

  • We can infer values that are non-causal.

  • Causality is all about your causal identification strategy.

  • Your identification strategy is how you justify making causal claims given your research design.

  • What does it even mean to “cause” something?

Causal Inference

  • Causal inference is the combination of an identification strategy and statistical inference.

  • It requires:

    • Convincing your reader that your estimand matches your theory.

    • Convincing your reader that your estimator captures your estimand.

    • Convincing your reader that the relationship represented by your estimand is causal.

    • Convincing your reader that your estimator is not confounded or collided.

Causal Inference Summarized

  • Inference: this is the statistical task of using a sample to learn about a population.

  • Causal inference: this is the research design task of convincing readers that the parameters you’re making inferences about represent causal relationships and not simply spurious correlations.

Machine Learning

  • The goal of machine learning is to learn (estimate) a function of interest given the data that go into and come out of the function.

  • Think of functions like we do in mathematics: \(f(x) \rightarrow y\)

  • \(x\) goes in and \(y\) comes out.

  • In most math courses (e.g., algebra), we know \(x\) and we know \(f(\cdot)\); we’re solving for \(y\).

  • In machine learning, we know \(x\) and we know \(y\); we’re solving for \(f(\cdot)\).

Why Machine Learning

We use machine learning for two primary purposes:

  1. We want to be able to predict new \(y\) values given new \(x\) values.
  2. We are interested in the properties of \(f(\cdot)\) itself (“inference”).

Simple Machine Learning

  1. Is OLS Linear Regression machine learning?
  2. Yes!
  3. We assume \(f(X) = \alpha + X\beta + u\) and want to learn \(\alpha, \beta\) using known \(X\) and \(Y\).

OLS for Inference

  • OLS is great for inference:

    • \(\beta\) is easy to interpret.

    • “A one unit increase in \(x_i\) corresponds to an expected \(\beta_i\) increase in \(y\), holding all other \(x_j\) constant.”

  • OLS is not great for prediction:

    • Often, no reason to expect constant linear relationships between \(X\) and \(Y\).

    • OLS chokes on problems with many independent variables.

    • OLS assumes a very strict functional form and interactions or transformations must be input manually.

So Are There Other ML Options?

Why Not Just Use OLS?

  • We often don’t know how \(X\) and \(Y\) are related.

  • Most ML methods make weaker assumptions about \(f(\cdot)\).

  • Many ML models can learn transformations, variable selection, and interactions from the data!

  • But, this comes at a cost:

    • Complicated (or even unknown) functional forms.

    • No concept of standard errors needed for inference.

      • See: conformal prediction.
  • Often ML is great for prediction, bad for inference.

Back to OLS

OLS Linear Regression

\(y = \alpha + \beta x + \beta_{1}z_{1} + \ldots + \beta_k z_k + u\)

  • \(y\): dependent variable (outcome)

  • \(x\): independent variable of interest (treatment)

  • \(z_j\): control variables / confounders (not interesting)

Back to OLS

OLS Linear Regression

\(y = \alpha + \beta x + \beta_{1}z_{1} + \ldots + \beta_k z_k + u\)

  • Can we have the best of both worlds?

    • What if we could assume a linear relationship between \(x\) and \(y\)

    • …and make fewer assumptions about the relationship between \(z\) and \(y\)?

Double Machine Learning

Double Machine Learning

  • In double machine learning, we can control for many non-linear confounds \((z)\) using any machine learning model.

  • Then, we can use a standard regression model to estimate the effect of \(x\) on \(y\).

  • Predictive performance and flexibility of machine learning!

  • Interpretability and uncertainty of traditional statistical inference!

But Let’s Start with OLS

  1. Load ice cream data.
  2. We’re interested primarily in sales \((y)\), price \((x)\), and day of week \((z)\).
data <- read.csv("data/ice_cream_sales.csv")
head(data)
  temp weekday cost price sales
1 17.3       6  1.5   5.6   173
2 25.4       3  0.3   4.9   196
3 23.3       5  1.5   7.6   207
4 26.9       1  0.3   5.3   241
5 20.2       1  1.0   7.2   227
6 26.1       6  0.5   6.6   193

Always Plot Your Data

Let’s Model It


Call:
lm(formula = sales ~ price + weekday, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.943 -13.669  -2.966  11.574  60.076 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 192.5306     1.0815 178.019  < 2e-16 ***
price         1.2285     0.1623   7.570 4.06e-14 ***
weekday       0.1111     0.0960   1.158    0.247    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.37 on 9997 degrees of freedom
Multiple R-squared:  0.00584,   Adjusted R-squared:  0.005641 
F-statistic: 29.36 on 2 and 9997 DF,  p-value: 1.928e-13

What’s Weird Here?

A Note on Nonlinearity

Now Let’s Try Another Method of OLS

y_on_z <- lm(sales ~ weekday, data=data)
x_on_z <- lm(price ~ weekday, data=data)
y_on_x <- lm(y_on_z$resid ~ x_on_z$resid)
summary(y_on_x)

Call:
lm(formula = y_on_z$resid ~ x_on_z$resid)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.943 -13.669  -2.966  11.574  60.076 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.936e-15  1.937e-01    0.00        1    
x_on_z$resid 1.229e+00  1.623e-01    7.57 4.05e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.37 on 9998 degrees of freedom
Multiple R-squared:  0.0057,    Adjusted R-squared:  0.0056 
F-statistic: 57.31 on 1 and 9998 DF,  p-value: 4.049e-14

Notice Anything Neat?

Standard Linear Regression

              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 192.530636 1.08151586 178.019245 0.000000e+00
price         1.228518 0.16228712   7.570028 4.061027e-14
weekday       0.111129 0.09599697   1.157630 2.470427e-01

Frisch-Waugh-Lovell

                 Estimate Std. Error      t value     Pr(>|t|)
(Intercept)  1.936429e-15  0.1936713 9.998533e-15 1.000000e+00
x_on_z$resid 1.228518e+00  0.1622790 7.570407e+00 4.049241e-14

They’re Exactly the Same!

Original OLS:

price         1.228518 0.16228712 7.570028 4.061027e-14

Frisch-Waugh-Lovell regression:

x_on_z$resid  1.228518 0.1622790  7.570407 4.049241e-14

Frisch-Waugh-Lovell regression:

To estimate the treatment effect of \(x\) on \(y\) in the presence of confounders \(z\), we could:

  1. Use a single OLS model
  2. Use FWL:
    1. Estimate \(\hat{y} = \alpha_1 + \beta_1 z\) and get the residuals \(u_y\)
    2. Estimate \(\hat{x} = \alpha_2 + \beta_2 z\) and get the residuals \(u_x\)
    3. Regress \(u_y\) on \(u_x\): \(\hat{u_y} = \alpha + \beta \hat{u_x}\)
  3. The coefficient \(\hat{\beta}\) for price in OLS is the same as the coefficient for the residualized price in FWL!
  4. Wow!

Let’s Visualize This

Now, Replace Sales with Residuals

Replace Price with Residuals

Now Plot FWL Regression

How Does FWL Regression Help Us?

  • We don’t have to use OLS for the first (or second) stage!

  • We can use any method we like to control for the confounding variables.

  • This includes methods that are much more flexible than OLS.

Double ML with a Random Forest

You may need to run the following code in your Console to install the randomForest package.

install.packages("randomForest")

Load the randomForest package:

library("randomForest")

Get the Sales Residuals

First, let’s regress \(y\) (sales) on \(z\) (weekday), our confounder.

rf_sales_on_weekday <- randomForest(sales ~ weekday, data=data)
rf_sales_on_weekday_resid <- rf_sales_on_weekday$y - rf_sales_on_weekday$predicted

Get the Price Residuals

rf_price_on_weekday <- randomForest(price ~ weekday, data=data)
rf_price_on_weekday_resid <- rf_price_on_weekday$y - rf_price_on_weekday$predicted

Plot the Residuals

Estimate the ATE

ols_ice_cream <- lm(rf_sales_on_weekday_resid~ rf_price_on_weekday_resid)

Check Out the Model


Call:
lm(formula = rf_sales_on_weekday_resid ~ rf_price_on_weekday_resid)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.385  -8.754   0.257   8.887  57.115 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                0.00499    0.13320   0.037     0.97    
rf_price_on_weekday_resid -3.45619    0.12004 -28.792   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.32 on 9998 degrees of freedom
Multiple R-squared:  0.07657,   Adjusted R-squared:  0.07647 
F-statistic:   829 on 1 and 9998 DF,  p-value: < 2.2e-16

Your Turn

Load the Data

sim_data <- read.csv("data/simulated_data.csv")
summary(sim_data)
       x                 y                z          
 Min.   :-1.2236   Min.   : 5.821   Min.   :-5.9762  
 1st Qu.: 0.6654   1st Qu.: 7.994   1st Qu.:-1.5420  
 Median : 1.1981   Median : 9.635   Median :-0.1643  
 Mean   : 1.1889   Mean   :11.955   Mean   :-0.1116  
 3rd Qu.: 1.6725   3rd Qu.:13.722   3rd Qu.: 1.3245  
 Max.   : 3.5069   Max.   :45.408   Max.   : 6.1052  

Visualize the Data

Estimate the Standard OLS Model:

Estimate an OLS linear regression of \(y = \alpha + \beta_1 x + \beta_2 z + u\).

Summarize Your Model:

What is the estimated effect of \(x\) on \(y\)?

Begin Double ML

Estimate a random forest model of \(y = z\). Compute the residuals and store them in a vector called y_on_z_resid.

Continue Double ML

Estimate a random forest model of \(x = z\). Compute the residuals and store them in a vector called x_on_z_resid.

Finish your Double ML Estimator

Estimate a linear model of y_on_z_resid ~ x_on_z_resid using the lm(...) function.

Conclusion

What Can Go Wrong?

  • You could overfit:

    • Machine learning algorithms can sometimes fit the data too well.

    • This would cause your residuals to have low variance.

    • This could cause you to underestimate your effect or standard errors.

  • The solution is to use k-fold cross-prediction:

    • Estimate your ML algorithms on partitions of the data.

    • Then, predict values out-of-sample to use in your third stage.

Conclusion

  • The Frisch-Waugh-Lovell Theorem says:

    • We can estimate the linear effect of \(x\) on \(y\) in the presence of confounders \(z\) using three separate equations.
  • Double Machine Learning says:

    • We can use any (potentially powerful) ML estimator in the first two stages of FWL regression.
  • Here, we get the benefits of flexible ML algorithms for the control variables and the interpretability of OLS for the treatment variable.

References